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'  '  *f-  ABSTRACT 

Statistical  methods  are  increasingly  being  applied  in  the  analysis  of 
climatological  data.  A  brief  introduction  to  subset  selection  approach  in 
multiple  decision  theory  is  given  to  illustrate  the  potential  applications 
in  climatology. 

1.  Introduction. 

The  need  for  statistical  methodology  in  analyzing  data  that  arise  in 
meterology  and  climatology  has  long  been  recognized.  Satisfactory  statistical 
models  have  been  found  to  describe  data  relating  to  precipitation;  see  for 
example,  Crutcher  (1968)  and  Mielke  (1973).  Time  series  data  occur  commonly 
in  climatological  studies.  Some  of  the  important  and  interesting  problems 
arise  in  connection  with  weather  modification  experiments,  objective  weather 
forecasting  and  classification  of  meterological  patterns.  Some  relevant 
references  are  Braham  (1979),  Bradley,  Srivastava  and  Lanzdorf  (1979),  Lund 
(1971),  McCutchan  and  Schroeder  (1973),  Mielke  (1979),  Neyman  (1977,1979), 
-Ad^Neym^n, . $co£t  on£  1 1  s  (1969)  (see  also  the  bibliography  by  Hanson  et 
af  TY^ijl^ln^h^fesent  paper,  our  main  interest  is  in  two  types  of 
problems.  The  first  deals  with  comparisons  of  sites  (weather  stations)  based 
on  appropriate  characteristics  of  weather  data.  For  example,  we  may  compare 
these  locations  on  the  basis  of  mean  annual  temperature  or  the  variability  of 
temperature  during  the  year.  The  second  problem  relates  to  selection  of  the  best 
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predictor  variables  in  the  regression  model  for  prediction.  We  discuss 
ranking  and  selection  formulation  of  these  multiple  decision  problems. 

Section  2  deals  with  the  basic  formulations  of  the  ranking  and  selection 
problems.  Some  specific  subset  selection  procedures  are  briefly  described 
in  Section  3.  These  deal  with  selection  from  normal  populations  in  terms 
of  the  means,  from  gamma  populations  in  terms  of  the  scale  parameters, 
and  from  multivariate  normal  populations  in  terms  of  multiple  correlation 
coefficients.  The  next  section  is  concerned  with  selection  of  the  best 
set  of  predictor  variables  in  a  regression  model. 


2.  Ranking  and  Selection  Theory  -  Basic  Formulations. 

To  describe  the  formulation  of  ranking  and  selection  problems,  let  us 

consider  k  independent  populations  ir^,  . where  is  characterized 

by  the  distribution  function  F(x,  e.)  where  is  a  parameter  which  represents 
the  'worth*  of  the  population.  For  example,  may  be  the  weather 
characteristic  of  the  ith  location.  Let  <_...<_  e^  denote  the 
ordered  e^.  To  be  specific,  let  us  say  it.,  is  preferable  to  irj  if  >  0j 
so  that  the  best  population  Is  the  one  associated  with  the  largest  e.. 

Ranking  and  selection  problems  have  been  generally  formulated  using  either 
Indifference  zone  approach  or  the  subset  selection  approach.  Under  the 
indifference  zone  formulation  of  Bechhofer  (1954),  we  want  a  procedure  R 
which  will  select  the  best  population  with  a  minimum  guaranteed  probability 


P*  (1/k  <  P*  <  1)  whenever  >  e*  where  e*  >  0  and  P*  are 


specified  in  advance.  The  problem  is  to  determine  the  minimum  sample  size 
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In  the  subset  selection  approach,  our  goal  is  to  select  a  non-empty 
subset  of  the  k  populations  so  that  the  best  population  is  included  in  the 
selected  subset  with  a  minimum  guaranteed  probability  P*.  Selection  of 
any  subset  which  includes  the  best  population  is  called  a  correct  selection 
(CS).  The  general  approach  is  to  evaluate  the  inflmum  of  P(CS|R),  the 
probability  of  a  correct  selection  using  the  procedure  R»  over  the 
parameter' space  ft  *  {e_:  e.*  (6j»  ...»  e^)}  and  obtain  the  constants 
involved  in  defining  R  so  that 

(2.1)  inf  P(CSJR)  >  P*. 

ft 

The  condition  (2.1)  is  referred  to  as  the  P*-condit1on  or  the  basic 
probability  requirement.  In  order  to  meet  this  requirement,  one  determines 
the  parametric  configuration  e.  for  which  the  inflmum  in  (2.1)  is  attained. 
Such  a  configuration  is  called  a  least  favorable  configuration  (LFC).  In 
general,  there  may  not  be  a  unique  LFC. 

For  an  extensive  survey  and  bibliography  of  ranking  and  selection 
theory  and  related  topics  the  reader  is  referred  to  the  recent  book  of  the 
authors  (1979).  Other  books  in  this  area  are  Bechhofer,  Kiefer  and  Sobel 
(1968),  and  Gibbons,  Olkin  and  Sobel  (1977). 


3.  Some  Subset  Selection  Procedures. 

In  this  section,  we  discuss  briefly  subset  selection  procedures  for 
normal  populations  in  terms  of  means,  for  gamma  populations  in  terms  of 
the  scale  parameter,  and  for  multivariate  normal  populations  In  terms  of 
muljtipjle' porreTatlohS'Coefflcients.  These  provide  procedures  that  are 
appl1<jabl[e  In'  a  ^pej  number  of  typical  cases. 


r 


r 
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3.1  Normal  Populations.  Let  ...»  wk  be  k  Independent  normal  populations 

2 

with  unknown  means  . nk»  respectively,  and  a  common  variance  o  .  Let 

i*l,  ...»  k,  be  the  sample  means  based  on  samples  of  size  n.  The  best 

2 

population  Is  the  one  associated  with  the  largest  When  a  is  known, 
the  procedure  proposed  by  Gupta  (1956)  selects  the  population  if  and 
only  If 

(3.1)  Xi  >  max(7r  ....  Xk)  - — 

✓  n 

where  dj  =  d^(k,  P*)  >  0  is  the  smallest  constant  such  that  the  condition 

(2.1)  is  satisfied.  The  LFC  is  given  by  =  ...  =  This  Implies  that 
d-|  Is  given  by 

(3.2)  /  ♦lc”1(x+d^)  *(x)  dx  =  P*. 

—  00 

where  #(x)  and  (x )  are  the  standard  normal  cdf  and  density,  respectively. 

The  values  of  dj  are  tabulated  for  several  values  of  k  and  P*  by  Gupta 
(1963)  and  Gupta,  Nagel  and  Panchapakesan  (1973). 

o 

When  o  Is  not  known,  the  procedure  Rg  of  Gupta  (1956)  is  the  same  as 

2  2 

Rj  with  a  replaced  by  s,  where  s  is  the  usual  pooled  estimator  of  0  based 
on  v  8  k(n-l)  degrees  of  freedom.  Here  again,  the  LFC  is  given  by 
U1  *  •••  a  The  values  of  the  constant  dg  (used  in  the  place  of  dj) 

are  tabulated  by  Gupta  and  Sobel  (1957)  for  selected  values  of  k,  v,  and 
P*. 

The  procedures  Rj  and  R2  can  be  modified  In  the  case  of  the  population 
with  the  smallest  pj  being  defined  the  best.  For  procedures  Involving 
unequal  sample  sizes,  see  Gupta  and  Huang  (197b),  and  Gupta  and  Wong  (1976). 
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3.2  Gamma  Populations.  Let  n.  have  the  associated  density 

xr-l 

- -  exp  (-  x/e.),  x  >  0,  e,  >  0 

r(r)eir  1  1 

0  otherwise. 

As  we  can  see.  It  is  assumed  that  the  populations  have  the  same  shape 
parameter  r(>  0).  Further,  r  is  assumed  to  be  known.  Our  interest  is 
selecting  the  population  associated  with  the  largest  (smallest)  e^.  The 
gamma  distribution  not  only  serves  as  a  model  for  certain  types  of 
measurement,  but  also  includes  the  case  where  the  observations  come  from 
normal  populations  and  the  interest  is  in  selecting  the  population 
associated  with  the  smallest  variance. 

For  selecting  the  population  associated  with  the  largest  e^,  Gupta 
(1963)  investigated  the  procedure  which  selects  if  and  only  if 

(3.4)  Xj  >  b  rnaxQij,  ....  Xk) 

where  Xp  ....  are  means  based  on  samples  of  equal  size  n,  and  the 
constant  b  (0  <  b  <  1)  is  chosen  so  that  the  P*-cond1t1on  is  met.  Gupta 
(1963)  has  shown  that  P(CS|R3)  is  minimized  when  e1  *  ...  «  ek  and  that 
the  constant  b  is  given  by 

(3.5)  /°°G^"^(x/b)  gv(x)  dx  =  P*. 

0 

where  G^fx)  is  the  cdf  of  a  standardized  gamma  random  variable  (i.e.  with 
e  =  1)  with  parameter  v/2  where  v  =  2nr.  Thus  the  constant  b  depends  on 
n  and  r  only  through  v  and  its  values  are  tabulated  by  Gupta  (1963  for 
selected  values  of  k,  P*,  and  v. 
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For  selecting  the  normal  population  with  the  smallest  variance,  an 
analogous  procedure  is  given  by  Gupta  and  Sobel  (1962a)  and  the  appropriate 
constant  can  be  obtained  from  the  tables  in  their  comparison  paper  (1962b). 

3.3  Multivariate  Normal  Populations.  Let  ir^,  ...»  ir^  be  k  independent 
p-variate  normal,  population  where  ir..  is  N(^,  J\).  Let  X^j  *  (X^,  Xi2’  •••*  Xip* 
be  a  random  observation  vector  from  n.,  1=1,  ...,  p.  The  populations  are 
ranked  in  terms  of  the  p^,  where  is  the  multiple  correlation  coefficient 
of  X^  with  respect  to  the  set  (X^»  ••.,  X^p).  We  are  interested  in 
selecting  a  subset  containing  the  population  associated  with  the  largest 
pj.  Let  R..  denote  the  sample  multiple  correlation  coefficient  between  X.^ 
and  (X^2»  •••»  X.jp).  Two  cases  arise:  (i)  The  case  in  which  X-2,  ....  X^ 
are  fixed,  called  the  conditional  case;  (il)  The  case  in  which  X^>  •••>  xip 
are  random,  called  the  unconditional  case.  In  either  case,  Gupta  and 
Panchapakesan  (1969)  proposed  and  studied  the  rule  ft  which  selects  if 
and  only  If 

(3.6)  RT2  i  c  max  Rt2 

i<j<k  J 

where  R^2  »  R2/(l-R.|2),  and  0  <  c  =  c(k,  P*,  p,  n)  <  1  is  chosen  to  satisfy 
the  P*-requirement.  In  this  case,  the  infimum  of  PCS  is  attained  when 
P1  3  p2  *  ••*  “  pk  =  0  an<*  the  aPPr°Pr1ate  constant  c  is  given  by 

O.7)  J  ^2q,2m  ^x/c^  f2q,2m  M  dx’ 

where  q  *  ^  (p-1 ) ,  m  a  j  (n-p);  Ff^s  denotes  the  cdf  of  an  F  random  variable 
with  r  and  s  degrees  of  freedom,  and  f  denotes  the  corresponding  density. 
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The  values  of  c  are  tabulated  by  Gupta  and  Panchepakesan  (1969)  for 
selected  values  of  k,  m,  q,  and  P*. 

4.  Selection  of  Best  Predictor  Variables. 

Many  examples  of  statistical  prediction  schemes  in  climatology  are 
available.  The  prediction  is  based  on  a  number  of  predictor  variables. 

While  the  prediction  can  be  made  more  accurate  by  bringing  in  as  many 
relevant  predictor  variables  as  possible,  some  of  them  may  be  highly  related 
among  themselves.  The  problem  of  selecting  the  best  set  of  predictor 
variables  arise  in  different  contexts.  Stringer  (1972  pp.  132-133)  has 
cited  examples  from  literature  relating  to  prediction  of  precipitation 
and  visibility.  Several  criteria  for  defining  the  best  set  of  predictor 
variables  and  various  techniques  for  selecting  the  best  set  have  been 
discussed  in  a  nice  expository  paper  by  Hocking  (1976).  Also,  a  brief 
review  and  evaluation  of  significant  methods  have  been  given  by  Thompson 
(1978).  However,  the  techniques  described  by  these  authors  are  not 
designed  to  find  a  best  set  of  variables  with  a  guaranteed  level  of 
probability.  Recently,  this  problem  has  been  investigated  by  Arvesen  and 
McCabe  (1973,  1975)  and  Gupta  and  Huang  (1977)  under  the  subset  selection 
formulation  described  earlier  which  includes  a  guaranteed  probability  of  a 
correct  selection.  Investigations  along  these  lines  continue  to  be  of 
interest  in  view  of  their  practical  Importance. 


r 
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climatological  data.  A  brief  introduction  to  subset  selection  approach 
in  multiple  decision  theory  1$  given  to  Illustrate  the  potential 
applications  In  climatology. 
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